Below is a description of our R environment.
sessionInfo()
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X El Capitan 10.11.6
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] backports_1.0.5 magrittr_1.5 rprojroot_1.2 tools_3.3.2
## [5] htmltools_0.3.5 yaml_2.1.14 Rcpp_0.12.10 stringi_1.1.2
## [9] rmarkdown_1.3 knitr_1.15.1 stringr_1.1.0 digest_0.6.11
## [13] evaluate_0.10
The data sets used in this project were government issued data (www.data.gov) and census data retrieved from data.world.
The first data set includes information about every registered higher education institution in the United States (and US territories) from 2003 to 2013. Some of the information relevant to our project includes location (state), average SAT data, demographic information, number of undergraduates, and tuition information. The second data set was taken from the 2010 census data. It includes the population of each state. The third data set, also taken from 2010 census data, includes information on the number of residents over 25, high school graduates, and Bachelors degree holders in each state.
The final dataset was obtained from the Center for Disease Control and is entitled adult_obese in the data.world repository. The data contains obesity rates as percent of the adult and adolescent populations for all 50 states. It further includes confidence intervals and sample sizes for each state.
The “dirty” version of our data can be accessed from data.world using the following R script.
source("../01_Data/data.world.R")
## Loading required package: jsonlite
## Loading required package: RCurl
## Loading required package: bitops
summary(df)
## Year UNITID
## Min. :2003 Min. : 100654
## 1st Qu.:2005 1st Qu.: 164872
## Median :2008 Median : 214962
## Mean :2008 Mean : 966061
## 3rd Qu.:2011 3rd Qu.: 418029
## Max. :2013 Max. :48285707
##
## F1SYSNAM
## NULL :38470
## #N/A : 6745
## ITT Educational Services Inc. : 1161
## Corinthian Colleges Inc. : 969
## EEG Inc. : 909
## Education Management Corporation: 899
## (Other) :29464
## INSTNM CITY STABBR
## ITT TECHNICAL INSTITUTE : 151 New York : 800 CA : 7744
## Marinello School of Beauty : 105 Chicago : 726 NY : 4981
## Lincoln Technical Institute: 84 Houston : 656 TX : 4564
## ITT Technical Institute : 79 Los Angeles: 555 PA : 4475
## Sanford-Brown Institute : 66 Brooklyn : 454 FL : 4022
## Southwestern College : 59 Miami : 441 OH : 3752
## (Other) :78073 (Other) :74985 (Other):49079
## ZIP main NUMBRANCH ADM_RATE_ALL
## 90010 : 73 Min. :0.0000 Min. : 1.000 NULL :47168
## 7306 : 67 1st Qu.:1.0000 1st Qu.: 1.000 1 : 2843
## 2116 : 66 Median :1.0000 Median : 1.000 0.816 : 149
## 85021 : 62 Mean :0.7858 Mean : 4.821 0.7359 : 132
## 11201 : 61 3rd Qu.:1.0000 3rd Qu.: 2.000 0.3449 : 122
## 23462 : 61 Max. :1.0000 Max. :141.000 0.5 : 82
## (Other):78227 (Other):28121
## ACTCMMID SAT_AVG_ALL UGDS UGDS_WHITE
## NULL :64564 NULL :61498 NULL : 5806 NULL :37453
## 22 : 2149 1010 : 371 35 : 218 0 : 9611
## 23 : 1923 1030 : 350 56 : 207 1 : 560
## 21 : 1917 970 : 332 44 : 202 0.5 : 85
## 24 : 1483 1050 : 327 40 : 196 0.8 : 69
## 20 : 1281 990 : 307 55 : 192 0.6667 : 68
## (Other): 5300 (Other):15432 (Other):71796 (Other):30771
## UGDS_BLACK UGDS_HISP UGDS_ASIAN UGDS_OTHER
## NULL :37453 NULL :37453 NULL :37453 NULL :37453
## 0 :11021 0 :11505 0 :16029 0 : 5491
## 1 : 138 1 : 725 0.0078 : 116 1 : 873
## 0.0714 : 70 0.0625 : 66 0.0044 : 114 0.2 : 49
## 0.0588 : 58 0.0278 : 63 0.0061 : 109 0.5 : 43
## 0.1429 : 58 0.0455 : 63 0.0071 : 108 0.1 : 42
## (Other):29819 (Other):28742 (Other):24688 (Other):34666
## TUITIONFEE_IN TUITIONFEE_OUT
## NULL :31829 NULL :33010
## 18048 : 530 18048 : 530
## 2700 : 116 17148 : 107
## 17148 : 106 15600 : 104
## 15600 : 104 9000 : 104
## 15495 : 100 14880 : 101
## (Other):45832 (Other):44661
To make our data easier to understand and work with, we created the following ETL (extract-translate-load) script to edit our CSV file containing higher education institution data using the steps below.
After cloning our repo, we set our working directory to the “00_Doc” folder.
We then downloaded and saved our dataset from data.world to the “01_Data” folder. We renamed the file to “PreETL_merged_university_data”, and created a file path leading to this file so that it could be read into RStudio as a data frame. (See the next section for instructions on how to download data files from data.world)
We looked through the data set and determined that the column names needed to be more descriptive. So we looked up what each of the variable names represented and renamed the variables so that they would be more descriptive. We did this by using the rename function from the plyr package.
For consistency, we decided to alter the zip codes so that all zip codes would only contain 5 digits. We did this by using the substring function.
Lastly, we changed all of the “NULL” values to read “NA in order to be more user friendly. This was accomplished by using the lapply and gsub functions.
We saved a copy of our “clean” CSV file by removing “PreETL_” from its name, and uploaded our clean dataset back to data.world.
The code used to accomplish these steps can be found in detail in the following R script. Below is a sample of our “clean” CSV file.
source("../01_Data/ETL.Final.R")
## Parsed with column specification:
## cols(
## .default = col_character(),
## Year = col_integer(),
## UNITID = col_integer(),
## main = col_integer(),
## NUMBRANCH = col_integer()
## )
## See spec(...) for full column specifications.
summary(df)
## Year Unit_ID System_Name
## 2013 : 7804 100654 : 11 #N/A : 6745
## 2012 : 7793 100663 : 11 ITT Educational Services Inc. : 1161
## 2011 : 7675 100690 : 11 Corinthian Colleges Inc. : 969
## 2010 : 7414 100706 : 11 EEG Inc. : 909
## 2009 : 7149 100724 : 11 Education Management Corporation: 899
## 2008 : 6975 100751 : 11 (Other) :29464
## (Other):33807 (Other):78551 NA's :38470
## Institution_Name City State
## ITT TECHNICAL INSTITUTE : 151 New York : 800 CA : 7744
## Marinello School of Beauty : 105 Chicago : 726 NY : 4981
## Lincoln Technical Institute: 84 Houston : 656 TX : 4564
## ITT Technical Institute : 79 Los Angeles: 555 PA : 4475
## Sanford-Brown Institute : 66 Brooklyn : 454 FL : 4022
## Southwestern College : 59 Miami : 441 OH : 3752
## (Other) :78073 (Other) :74985 (Other):49079
## Zip_Code Main Number_of_Branches Admissions_Rate
## Length:78617 0:16842 1 :55705 1 : 2843
## Class :character 1:61775 2 : 6871 0.816 : 149
## Mode :character 3 : 3806 0.7359 : 132
## 4 : 2374 0.3449 : 122
## 5 : 1510 0.5 : 82
## 6 : 1130 (Other):28121
## (Other): 7221 NA's :47168
## Avg_ACT_Score Avg_SAT_Score Total_Undergraduates Percent_White
## 22 : 2149 1010 : 371 35 : 218 0 : 9611
## 23 : 1923 1030 : 350 56 : 207 1 : 560
## 21 : 1917 970 : 332 44 : 202 0.5 : 85
## 24 : 1483 1050 : 327 40 : 196 0.8 : 69
## 20 : 1281 990 : 307 55 : 192 0.6667 : 68
## (Other): 5300 (Other):15432 (Other):71796 (Other):30771
## NA's :64564 NA's :61498 NA's : 5806 NA's :37453
## Percent_Black Percent_Hispanic Percent_Asian Percent_Other
## 0 :11021 0 :11505 0 :16029 0 : 5491
## 1 : 138 1 : 725 0.0078 : 116 1 : 873
## 0.0714 : 70 0.0625 : 66 0.0044 : 114 0.2 : 49
## 0.0588 : 58 0.0278 : 63 0.0061 : 109 0.5 : 43
## 0.1429 : 58 0.0455 : 63 0.0071 : 108 0.1 : 42
## (Other):29819 (Other):28742 (Other):24688 (Other):34666
## NA's :37453 NA's :37453 NA's :37453 NA's :37453
## In-State_Tuition Out-State_Tuition
## 18048 : 530 18048 : 530
## 2700 : 116 17148 : 107
## 17148 : 106 15600 : 104
## 15600 : 104 9000 : 104
## 15495 : 100 14880 : 101
## (Other):45832 (Other):44661
## NA's :31829 NA's :33010
Once we cleaned up our data, we uploaded a new copy of it to data.world. To import this csv file for use in Tableau:
On the Tableau homescreen, under “To a Server” click “Web Data Connector” on the left-hand Connect menu.
For the URL, type “https://tableau.data.world” and press Enter.
For the Dataset, type “mamilloy/s-17-dv-project-6”.
Copy your personal data.world API Token into the next field and click “Get Dataset”.
All csv files in the above dataset will appear on the left-hand side of the screen to be used.
Our final Shiny application can be accessed at the following address: https://mamilloy.shinyapps.io/final_project/
In Tableau, we created a crosstabs visualization to analyze possible relationships between year and state. Specifically in this visual, we looked at the ratio of in-state tuition to total graduate students across the states. First, two data sets from our data.world profile needed to be joined. The university data had the number of graduate students and the 2010 Data_Update2 sheet had the in-state tuition information for each state. We completed an inner join on state of these two data sets in order to get the in-state tuition and total undergraduate students for each state. Across the x-axis of the cross tabs in year and along the y-axis is state. In each of the cells is the total number of undergraduates for that year and state. The KPI is the in-state tuition over the total number of undergraduates. The table is colored by KPI: the cells in orange are the low KPI, red is medium KPI, and blue is high KPI. This visual is interesting because it can be seen how the tuition for in-state students can be broken down across the average university population in that state. It really shows how high in-state tuition can be.
In-State Tuition vs. Total Undergraduates
The shiny app was also created to mimic the Tableau visual. In order to join the data via the server.R, a SQL query was written that joined the two datasets, calculated the KPI, and separated KPI values into low, medium, and high. The KPI was calculated was the same as in the Tableau visual. The range values for low, medium, and high were set in the ui.R code to match the values set in Tableau.
In-State Tuition vs. Total Undergraduates
This first barchart displays the percentage of people working toward a bachelors degree per U.S. state. To create this plot, two datasets were connected using an inner join on each of the states. The first dataset, acsInfo, contained the number of bachelors students per state, and the second dataset, censusdata, containted each state’s total population. Both of these datasets were obtained from the U.S. Census Bureau’s 2010 census data. A calculated field was created to determine the percentage of those studying toward a bachelors degree in each state by dividing bachelors students by state population. Plotting states on the X axis and the calculated field on the Y axis created the final barchart. An average line was used to show where the average percentage of bachelors students falls across all states, which was 12.603%.
Bachelors Students per State
The above Tableau barchart was also created in Shiny by using ggplot. This Shiny app also displays the percentage of people working toward a bachelors degree in each U.S. state. This plot also contains an average line showing the average percentage of bachelors students across all states, with it’s value displayed in the side panel to the left of the barchart.
Bachelors Students per State
This next graph depicts the average percentage of students of each demographic category across all the institutions in each US state and territory. Demographic categories include White, Black, Hispanic, Asian, and other. The data is displayed as a stacked bar graph and was retrieved from “Clean_merged_university_data.csv”. In Tableau, this plot was made by dragging “State” into Rows and dragging the measure values for each demographic into Columns. To make this graph in Shiny, the plotly package was used to make a stacked bar graph. Using this package also offers the benefits of interactivity when hovering over columns.
The following histogram was created in Tableau to show how many institutions fall within a certain range of in-state tuition on a yearly basis. The X axis represents bins in increments of $5000 for in-state tuition, and the Y axis represents the number of institutions that fall within each bin. Pages were also added to this visualization such that it could be observed how the number of schools within each bin changes from year to year. As tuition prices increased each year, a visible shift could be seen towards the right of the graph, and the 2003 graph’s right skew became less noticeable.
In-State Tuition Distribution
A similar histogram was also created in Shiny using ggplot and plotly. This histogram also shows how many instututions fall within a certain range of in-state tuition prices, but this histogram includes all data from 2003 to 2013. Hovering over each bin in the app will display each bin’s count. This histogram also includes a reference line representing the average in-state tuition price across all institutions.
In-State Tuition Distribution
This scatterplot tests the correlation of obesity to education level across all 50 states using the R packages ggplot and Shiny. The plot displays the percentages of obese people per U.S. state against the number of state citizens possessing a college degree over total state population. The plot reveals a very clear negative correlation between education level and obesity rate by state and is emphasized by the addition of a regression line. The data was amassed from 3 data.world tables that were double joined to make the needed query. The Shiny code (server.R and ui.R) allows for easy sharing of the visualization and the plotly R package imparts hovering functionality to make the graphic more interactive.
Obesity
This Tableau graphic recapitulates the scatterplot correlating state obesity rate to education level. The data for the plot was collected via a web SQL query utilizing a double INNER JOIN (see associated image below) allowing the construction of the two measures: prcntObesity against the calculated field(100*(bachelors/population)). The plot was then filtered on state and a trend-line was added. As mentioned previously, a negative association is observed between state education level and obesity rate.
Obesity
To build the above scatterplot, a web connection to our data.world repository was established and a double INNER JOIN between the adult_obese, censusdata and acsinfo tables was made that joined on the state variable. It is a graphical representation of the following SQL query: SELECT O.prcntObese, C.State_Population, O.Abbrv, A.bachelors, A.State FROM adult_obese as O JOIN censusdata as C on O.Abbrv = C.State_Avg JOIN acsInfo as A on A.State = O.Abbrv;
Obesity Join
Using Shiny Actions, we also created a series of scatterplots displaying the average undergraduate SAT score for U.S.universities and grouped them together into categories based on what U.S. state or territitory the school is located . Each state runs along the X axis, and scores run along the Y axis. The scatterplot is also color coded by state with a legend to the right of the plot. Each university is represented by a single point, and each state’s average SAT score is represented by a black diamond. By clicking and dragging a selection box across a subset of the points in the scatterplot, a new scatterplot is created that displays only those points selected. This visualization was generated with the ggplot2 R package using its geom_point() and stat_summary() functions.
SAT Scores
This is a box plot that displays the average SAT scores for each school broken down by state. It was created using data from the ‘Clean-Merged-University” data set. We got the data on tableau from using the web connector and connecting through data.world. For the shiny app, we ran a query through data.world. This graphic is interesting because you can see how SAT scores vary across states. In tableau this graphic was created by putting state onto the Columns section, and Average SAT Score onto the rows section. Year was dragged onto color to see which year each score came from. In shiny, it was created using the plot and geom_boxplot functions.